acceptance criteria
Evidence-Bound Autonomous Research (EviBound): A Governance Framework for Eliminating False Claims
LLM-based autonomous research agents report false claims: tasks marked "complete" despite missing artifacts, contradictory metrics, or failed executions. EviBound is an evidence-bound execution framework that eliminates false claims through dual governance gates requiring machine-checkable evidence. Two complementary gates enforce evidence requirements. The pre-execution Approval Gate validates acceptance criteria schemas before code runs, catching structural violations proactively. The post-execution Verification Gate validates artifacts via MLflow API queries (with recursive path checking) and optionally validates metrics when specified by acceptance criteria. Claims propagate only when backed by a queryable run ID, required artifacts, and FINISHED status. Bounded, confidence-gated retries (typically 1-2 attempts) recover from transient failures without unbounded loops. The framework was evaluated on 8 benchmark tasks spanning infrastructure validation, ML capabilities, and governance stress tests. Baseline A (Prompt-Level Only) yields 100% hallucination (8/8 claimed, 0/8 verified). Baseline B (Verification-Only) reduces hallucination to 25% (2/8 fail verification). EviBound (Dual Gates) achieves 0% hallucination: 7/8 tasks verified and 1 task correctly blocked at the approval gate, all with only approximately 8.3% execution overhead. This package includes execution trajectories, MLflow run IDs for all verified tasks, and a 4-step verification protocol. Research integrity is an architectural property, achieved through governance gates rather than emergent from model scale.
Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence
Sharrock, Callum, Petersson, Lukas, Petersson, Hanna, Backlund, Axel, Wennström, Axel, Nordström, Kristoffer, Aronsson, Elias
We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench. Language models (LMs) were initially intended for narrow text understanding tasks. The first Transformer-based LM (V aswani et al., 2017) was explicitly trained for translation. However, large-scale training runs of LMs eventually resulted in emergent behaviour - model capabilities that were not explicitly trained for (Brown et al., 2020). For example, LLMs are not trained to be robots, yet companies such as Figure (Helix, 2025) and Google DeepMind (Gemini Robotics 1.5, 2025) use LLMs in their robotic stack.
- Leisure & Entertainment (0.93)
- Information Technology (0.68)
Multi-Agent LLMs as Ethics Advocates for AI-Based Systems
Yamani, Asma, Baslyman, Malak, Ahmed, Moataz
--Incorporating ethics into the requirement elicitation process is essential for creating ethically aligned systems. Although eliciting manual ethics requirements is effective, it requires diverse input from multiple stakeholders, which can be challenging due to time and resource constraints. Moreover, it is often given a low priority in the requirements elicitation process. This study proposes a framework for generating ethics requirements drafts by introducing an ethics advocate agent in a multi-agent LLM setting. This agent critiques and provides input on ethical issues based on the system description. The proposed framework is evaluated through two case studies from different contexts, demonstrating that it captures the majority of ethics requirements identified by researchers during 30-minute interviews and introduces several additional relevant requirements. However, it also highlights reliability issues in generating ethics requirements, emphasizing the need for human feedback in this sensitive domain. We believe this work can facilitate the broader adoption of ethics in the requirements engineering process, ultimately leading to more ethically aligned products. Artificial intelligence (AI) has gained widespread adoption across various domains, including healthcare, finance, education, and marketing.
- Asia > Middle East > Saudi Arabia > Eastern Province > Dhahran (0.14)
- North America > United States > Hawaii (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
Comparison of Data Reduction Criteria for Online Gaussian Processes
Wietzke, Thore, Graichen, Knut
Gaussian Processes (GPs) are widely used for regression and system identification due to their flexibility and ability to quantify uncertainty. However, their computational complexity limits their applicability to small datasets. Moreover in a streaming scenario, more and more datapoints accumulate which is intractable even for Sparse GPs. Online GPs aim to alleviate this problem by e.g. defining a maximum budget of datapoints and removing redundant datapoints. This work provides a unified comparison of several reduction criteria, analyzing both their computational complexity and reduction behavior. The criteria are evaluated on benchmark functions and real-world datasets, including dynamic system identification tasks. Additionally, acceptance criteria are proposed to further filter out redundant datapoints. This work yields practical guidelines for choosing a suitable criterion for an online GP algorithm.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)
- North America > United States > New York (0.04)
- (2 more...)
Private GPTs for LLM-driven testing in software development and machine learning
Jagielski, Jakub, Rojas, Consuelo, Abel, Markus
In this contribution, we examine the capability of private GPTs to automatically generate executable test code based on requirements. More specifically, we use acceptance criteria as input, formulated as part of epics, or stories, which are typically used in modern development processes. This gives product owners, or business intelligence, respectively, a way to directly produce testable criteria through the use of LLMs. We explore the quality of the so-produced tests in two ways: i) directly by letting the LLM generate code from requirements, ii) through an intermediate step using Gherkin syntax. As a result, it turns out that the two-step procedure yields better results -where we define better in terms of human readability and best coding practices, i.e. lines of code and use of additional libraries typically used in testing. Concretely, we evaluate prompt effectiveness across two scenarios: a simple "Hello World" program and a digit classification model, showing that structured prompts lead to higher-quality test outputs.
Automated Behaviour-Driven Acceptance Testing of Robotic Systems
Nguyen, Minh, Wrede, Sebastian, Hochgeschwender, Nico
The specification and validation of robotics applications require bridging the gap between formulating requirements and systematic testing. This often involves manual and error-prone tasks that become more complex as requirements, design, and implementation evolve. To address this challenge systematically, we propose extending behaviour-driven development (BDD) to define and verify acceptance criteria for robotic systems. In this context, we use domain-specific modelling and represent composable BDD models as knowledge graphs for robust querying and manipulation, facilitating the generation of executable testing models. A domain-specific language helps to efficiently specify robotic acceptance criteria. We explore the potential for automated generation and execution of acceptance tests through a software architecture that integrates a BDD framework, Isaac Sim, and model transformations, focusing on acceptance criteria for pick-and-place applications. We tested this architecture with an existing pick-and-place implementation and evaluated the execution results, which shows how this application behaves and fails differently when tested against variations of the agent and environment. This research advances the rigorous and automated evaluation of robotic systems, contributing to their reliability and trustworthiness.
- Europe > Germany > Bremen > Bremen (0.28)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States (0.04)
- (2 more...)
AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions
Pysmennyi, Ihor, Kyslyi, Roman, Kleshch, Kyrylo
Traditional quality assurance (QA) methods face significant challenges in addressing the complexity, scale, and rapid iteration cycles of modern software systems and are strained by limited resources available, leading to substantial costs associated with poor quality. The object of this research is the Quality Assurance processes for modern distributed software applications. The subject of the research is the assessment of the benefits, challenges, and prospects of integrating modern AI-oriented tools into quality assurance processes. We performed comprehensive analysis of implications on both verification and validation processes covering exploratory test analyses, equivalence partitioning and boundary analyses, metamorphic testing, finding inconsistencies in acceptance criteria (AC), static analyses, test case generation, unit test generation, test suit optimization and assessment, end to end scenario execution. End to end regression of sample enterprise application utilizing AI-agents over generated test scenarios was implemented as a proof of concept highlighting practical use of the study. The results, with only 8.3% flaky executions of generated test cases, indicate significant potential for the proposed approaches. However, the study also identified substantial challenges for practical adoption concerning generation of semantically identical coverage, "black box" nature and lack of explainability from state-of-the-art Large Language Models (LLMs), the tendency to correct mutated test cases to match expected results, underscoring the necessity for thorough verification of both generated artifacts and test execution results. The research demonstrates AI's transformative potential for QA but highlights the importance of a strategic approach to implementing these technologies, considering the identified limitations and the need for developing appropriate verification methodologies.
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.05)
- North America > United States > Minnesota (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Information Technology (0.68)
- Banking & Finance (0.68)
- Information Technology > Software Engineering (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
APE: Selective Fine-tuning with Acceptance Criteria for Language Model Adaptation
Adapting large pre-trained language models to specific tasks requires balancing performance improvement with preservation of learned capabilities. Standard fine-tuning approaches optimize a single objective function through gradient descent, often leading to catastrophic forgetting [16] or instability in learned representations. Parameter-efficient methods like LoRA [11] constrain modifications to low-dimensional subspaces but limit adaptation scope. We propose Adjacent Possible Exploration (APE), a selective fine-tuning approach that explores multiple parameter modification directions while implementing acceptance criteria to maintain model stability. The method draws conceptual inspiration from evolutionary optimization principles, particularly the biological constraint that viable changes must preserve essential system properties while enabling incremental improvement. APE operates by generating multiple candidate parameter updates through fine-tuning on randomly sampled data subsets, then selecting only those updates that exceed a performance improvement threshold. This creates a filtered optimization process that systematically explores beneficial parameter modifications while rejecting changes that fall within noise levels or potentially destabilize learned representations. Our key contributions include: (1) A practical algorithm for selective fine-tuning that balances exploration and stability, (2) Empirical validation showing superior performance compared to standard adaptation methods, and (3) Analysis of why selective acceptance of parameter modifications leads to more robust model adaptation. 1 The approach demonstrates that systematic exploration of parameter space through filtered selection can achieve better adaptation results than unconstrained optimization, providing a principled framework for controlled model modification that maintains stability while enabling significant performance improvements.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology
Nguyen, Minh Huynh, Chau, Thang Phan, Nguyen, Phong X., Bui, Nghi D. Q.
Software agents have emerged as promising tools for addressing complex software engineering tasks. Existing works, on the other hand, frequently oversimplify software development workflows, despite the fact that such workflows are typically more complex in the real world. Thus, we propose AgileCoder, a multi agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles - such as Product Manager, Developer, and Tester to different agents, who then collaboratively develop software based on user inputs. AgileCoder enhances development efficiency by organizing work into sprints, focusing on incrementally developing software through sprints. Additionally, we introduce Dynamic Code Graph Generator, a module that creates a Code Dependency Graph dynamically as updates are made to the codebase. This allows agents to better comprehend the codebase, leading to more precise code generation and modifications throughout the software development process. AgileCoder surpasses existing benchmarks, like ChatDev and MetaGPT, establishing a new standard and showcasing the capabilities of multi agent systems in advanced software engineering environments.
Stock Price Prediction using Dynamic Neural Networks
This paper will analyze and implement a time series dynamic neural network to predict daily closing stock prices. Neural networks possess unsurpassed abilities in identifying underlying patterns in chaotic, non-linear, and seemingly random data, thus providing a mechanism to predict stock price movements much more precisely than many current techniques. Contemporary methods for stock analysis, including fundamental, technical, and regression techniques, are conversed and paralleled with the performance of neural networks. Also, the Efficient Market Hypothesis (EMH) is presented and contrasted with Chaos theory using neural networks. This paper will refute the EMH and support Chaos theory. Finally, recommendations for using neural networks in stock price prediction will be presented.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- Europe > United Kingdom > England > Dorset > Bournemouth (0.04)